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This paper presents the status of the APEmille project, which is essentially completed, as far as machine devel- 
opment and construction is concerned. Several large installations of APEmille are in use for physics production 
runs leading to many new results presented at this conference. This paper briefly summarizes the APEmille 
architecture, reviews the status of the installations and presents some performance figures for physics codes. 
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1. OVERVIEW 

The APEmille project was started with the 
goal of developing and commissioning a massively 
parallel computer optimized for lattice gauge the- 
ory (LGT), with a peak performance close to the 
1 TFlops threshold [0. APEmille is the third 
generation of APE machines J2|. The project 
is now completed, as far as hardware develop- 
ment and construction are concerned: several 
APEmille systems have been installed at vari- 
ous sites, providing an overall processing power 
close to 1.5 TFlops. These systems have become 



the main workhorse for doing LGT simulations 
for several research groups ||. In this paper, we 
briefly recall the features of APEmille, present 
the status of the project, describe the final shape 
of the software environment and discuss the per- 
formance achieved by physics programs. 

2. APEmille ARCHITECTURE 

APEmille is based on a three dimensional mesh 
of nodes connected by a synchronous communica- 
tion network linking first neighbours. All nodes 
in the mesh operate in Single Instruction Multiple 
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Table 1 

Key parameters of APEmille: 



Figure 1. Three APEmille racks. 

Data (SIMD) mode. Each node consists of a 0.5 
GFlops custom developed floating point proces- 
sor and 32 MB data memory. At each clock cycle 
the processor is able to complete the operation 
axb+c following the IEEE standard, where a, b, c 
are single precision complex or double precision 
real operands. APEmille is a highly modular and 
scalable system. The basic building block and 
smallest independently running entity is a pro- 
cessing board (PB) with 2x2x2 nodes. Up to 
16 PBs are interconnected via a single backplane 
to form an APEmille crate. One crate has 128 
processors, approximately 65 GFlops peak perfor- 
mance and 4 GB data memory. Larger systems, 
which can be re-partitioned by software, are as- 
sembled by connecting n crates together. The 
corresponding topology is 2n x 8 x 8. 

APEmille systems are connected to a network 
of host PCs with a Linux operating system, each 
hosting 4 PBs. The network contains one or more 
master PCs on which users log in to start their ap- 
plication programs. Input/output is usually per- 
formed onto disks belonging to the master. This 
I/O setup is limited by the performance of the 
network connecting the PCs (typically FastEther- 
net). The measured bandwidth is of the order of 6 
MBytes/sec. Higher I/O rates can be achieved by 



Peak performance 
Clock frequency 
FP registers 
Data memory 
Communication BW 
I/O BW per master 
Power consumption 
Price 



528 MFlops/proc 

66 MHz 

512 (32-bit) 

32 MByte/proc 

66 MByte/s/direction 

6 MByte/s 

28 W/GFlops 

2.5 Euro/MFlops peak 



hooking 'local' disks directly to the host PCs with 
Ultra2 SCSI channels. In this setup the band- 
width scales with the size of the system. Typ- 
ically it is of the order of 100 MBytes/sec per 
crate. 

Power consumption of APEmille systems is 
very low (less than 30 W/GFlops) and the foot- 
print of a two-crate rack is about 0.7m 2 . For these 
reasons, APEmille machines are simply air cooled 
and do not need complex infrastructure. 

3. PROJECT STATUS 

After prototypes were assembled and tested in 
late 1999, a first round of production was carried 
out in the year 2000. A second and final round of 
production was started in spring 2001. APEmille 
systems have been installed at several sites be- 
longing to the APEmille collaboration and also 
at a few more universities and research labs (see 
Tab. 1). The presently installed peak process- 
ing power is about 1460 GFlops and will grow 
to about 2.2 TFlops once the final round of con- 
struction is completed. 

APEmille is a very stable system: up-times of 
1-2 months or more are routinely achieved. Hard- 
ware maintenance is typically limited to simple 
replacement of ageing modules and is therefore 
cheap, both in terms of hardware costs and man- 
power. 

4. SOFTWARE AND PERFORMANCE 

APEmille systems use the familiar TAO pro- 
gramming language, already used in all previ- 
ous APE machines. The language has been ex- 
tended by very few elements to exploit the new 
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Table 2 

APEmille installations by the end of 2001. For 
each site the total peak performance is given. 
Sites in italic refer to institutions not belonging 
to the APE collaboration. 



Site 


GFlops 


Rome I 


650 


DESY/NIC Zeuthen 


540 


Pisa 


260 


Rome II 


260 


Milano 


130 


Bari 


65 


Uni. Paris Sud 


16 


Bielefeld 


145 


Swansea 


80 


INFN-LNGS 


65 


Total 


2211 



features of APEmille, like double precision and 
local integer data types, and the increased num- 
ber of registers. Old APE100 programs can be 
re-compiled for the new machine with almost no 
changes. High performance, however, is achieved 
only after some tuning of the codes. To this end 
the TAO programmer still has to follow only a 
few optimisation guidelines. 

Typically, the efficiency is limited by latencies 
in local memory accesses and by both latencies 
and bandwidth in remote data transfers. Good 
performance requires therefore some care to hide 
the latencies and to have data already available 
in registers when they are needed by the floating- 
point unit. 

The steps to boost efficiency include: 

• Intensive use of registers to prefetch data. 

• Memory accesses which are known to be al- 
ways local are flagged to the compiler, so 
that a more aggressive scheduling can be 
employed. 

• Memory accesses are made in large bursts 
of data (say up to 36-96 complex numbers 
in a single burst) and complex conjugation 
is performed "on the fly" when loading the 
data. 

• Use of APEmille intrinsic 64-bit floating- 



point format in precision critical parts of 
the code. 

Using the steps described above to optimize 
a bi-conjugate gradient solver for the improved 
Wilson-Dirac operator, the following performance 
numbers have been obtained: 

• The main loop in the most time consuming 
kernel of the code achieves a pipeline filling 
of more than 80%. 

• Computations with SU(3) matrices (clover 
term) run at 380 MFlops per node. 

• The full inverter runs at 200 MFlops sus- 
tained (distributed lattice with local volume 
6 x 3 x 3 x 64 on each node). 

5. CONCLUSIONS 

With the completion of the last APEmille sys- 
tems at various sites by end of this year, an other 
2 TFlops of overall compute power will be avail- 
able to the LGT community. APEmille has be- 
come a main and reliable workhorse for many 
groups. System and compiler software are very 
stable, but further adjustments may bring ad- 
ditional improvements in performance and user- 
friendliness. While the APEmille project has ba- 
sically been finished, the development of a next 
generation of APE machines is in progress . 
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